{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3",
      "language": "python"
    },
    "language_info": {
      "name": "python",
      "version": "3.7.6",
      "mimetype": "text/x-python",
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "pygments_lexer": "ipython3",
      "nbconvert_exporter": "python",
      "file_extension": ".py"
    },
    "colab": {
      "provenance": []
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "POWZoSJR6XzK"
      },
      "source": [
        "# Anatomy of a txtai index\n",
        "\n",
        "This notebook inspects the filesystem of a txtai embeddings index and gives an overview of the structure."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "qa_PPKVX6XzN"
      },
      "source": [
        "# Install dependencies\n",
        "\n",
        "Install `txtai` and all dependencies."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "_uuid": "8f2839f25d086af736a60e9eeb907d3b93b6e0e5",
        "_cell_guid": "b1076dfc-b9ad-4769-8c92-a6c4dae69d19",
        "trusted": true,
        "_kg_hide-output": true,
        "id": "24q-1n5i6XzQ"
      },
      "source": [
        "%%capture\n",
        "!pip install git+https://github.com/neuml/txtai\n",
        "!apt-get update && apt-get install -y file xxd"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Create index\n",
        "Let's first create an index to inspect. We'll use the classic txtai example.\n"
      ],
      "metadata": {
        "id": "0p3WCDniUths"
      }
    },
    {
      "cell_type": "code",
      "metadata": {
        "_uuid": "d629ff2d2480ee46fbb7e2d37f6b5fab8052498a",
        "_cell_guid": "79c7e3d0-c299-4dcb-8224-4455121ee9b0",
        "trusted": true,
        "id": "2j_CFGDR6Xzp",
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "outputId": "4c16f389-2cf0-46d9-9cb8-bdda04d06559"
      },
      "source": [
        "from txtai.embeddings import Embeddings\n",
        "\n",
        "data = [\"US tops 5 million confirmed virus cases\",\n",
        "        \"Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg\",\n",
        "        \"Beijing mobilises invasion craft along coast as Taiwan tensions escalate\",\n",
        "        \"The National Park Service warns against sacrificing slower friends in a bear attack\",\n",
        "        \"Maine man wins $1M from $25 lottery ticket\",\n",
        "        \"Make huge profits without work, earn up to $100,000 a day\"]\n",
        "\n",
        "# Create embeddings index with content enabled. The default behavior is to only store indexed vectors.\n",
        "embeddings = Embeddings({\"path\": \"sentence-transformers/nli-mpnet-base-v2\", \"content\": True, \"objects\": True})\n",
        "\n",
        "# Create an index for the list of text\n",
        "embeddings.index([(uid, text, None) for uid, text in enumerate(data)])\n",
        "\n",
        "# Run a search\n",
        "embeddings.search(\"feel good story\", 1)"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "[{'id': '4',\n",
              "  'score': 0.08329004049301147,\n",
              "  'text': 'Maine man wins $1M from $25 lottery ticket'}]"
            ]
          },
          "metadata": {},
          "execution_count": 26
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Print index info\n",
        "\n",
        "Embeddings indexes have an `info` method which prints metadata about the index. This can be used to see when the index was build, what settings were used and when it was last updated."
      ],
      "metadata": {
        "id": "pHqeRmHtw1ui"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Print metadata\n",
        "embeddings.info()"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "o7nKY0AWxBWU",
        "outputId": "be7eca6e-dbbc-40c5-df1f-9726554de476"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "{\n",
            "  \"backend\": \"faiss\",\n",
            "  \"build\": {\n",
            "    \"create\": \"2022-03-02T15:18:41Z\",\n",
            "    \"python\": \"3.7.12\",\n",
            "    \"settings\": {\n",
            "      \"components\": \"IDMap,Flat\"\n",
            "    },\n",
            "    \"system\": \"Linux (x86_64)\",\n",
            "    \"txtai\": \"4.3.0\"\n",
            "  },\n",
            "  \"content\": \"sqlite\",\n",
            "  \"dimensions\": 768,\n",
            "  \"objects\": true,\n",
            "  \"offset\": 6,\n",
            "  \"path\": \"sentence-transformers/nli-mpnet-base-v2\",\n",
            "  \"update\": \"2022-03-02T15:18:41Z\"\n",
            "}\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Save index and review file structure\n",
        "\n",
        "Next let's save the index and review the file structure. This section prints each file, and runs commands to show"
      ],
      "metadata": {
        "id": "BYWUFBUGyKyY"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Save the index\n",
        "embeddings.save(\"index\")\n",
        "\n",
        "# Show basic details about index files\n",
        "for f in [\"config\", \"documents\", \"embeddings\"]:\n",
        "  !ls -l \"index/{f}\"\n",
        "  !xxd \"index/{f}\" | head -5\n",
        "  !file \"index/{f}\"\n",
        "  !echo\n"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "aPH-dnV2ZuL1",
        "outputId": "6d8d1329-a2e8-4538-b197-0e2959b9eef2"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "-rw-r--r-- 1 root root 295 Mar  2 15:18 index/config\n",
            "00000000: 8004 951c 0100 0000 0000 007d 9428 8c04  ...........}.(..\n",
            "00000010: 7061 7468 948c 2773 656e 7465 6e63 652d  path..'sentence-\n",
            "00000020: 7472 616e 7366 6f72 6d65 7273 2f6e 6c69  transformers/nli\n",
            "00000030: 2d6d 706e 6574 2d62 6173 652d 7632 948c  -mpnet-base-v2..\n",
            "00000040: 0763 6f6e 7465 6e74 948c 0673 716c 6974  .content...sqlit\n",
            "index/config: data\n",
            "\n",
            "-rw-r--r-- 1 root root 28672 Mar  2 15:18 index/documents\n",
            "00000000: 5351 4c69 7465 2066 6f72 6d61 7420 3300  SQLite format 3.\n",
            "00000010: 1000 0101 0040 2020 0000 0001 0000 0007  .....@  ........\n",
            "00000020: 0000 0000 0000 0000 0000 0001 0000 0004  ................\n",
            "00000030: 0000 0000 0000 0000 0000 0001 0000 0000  ................\n",
            "00000040: 0000 0000 0000 0000 0000 0000 0000 0000  ................\n",
            "index/documents: SQLite 3.x database, last written using SQLite version 3022000\n",
            "\n",
            "-rw-r--r-- 1 root root 18570 Mar  2 15:18 index/embeddings\n",
            "00000000: 4978 4d70 0003 0000 0600 0000 0000 0000  IxMp............\n",
            "00000010: 0000 1000 0000 0000 0000 1000 0000 0000  ................\n",
            "00000020: 0100 0000 0049 7846 4900 0300 0006 0000  .....IxFI.......\n",
            "00000030: 0000 0000 0000 0010 0000 0000 0000 0010  ................\n",
            "00000040: 0000 0000 0001 0000 0000 0012 0000 0000  ................\n",
            "index/embeddings: data\n",
            "\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "The directory has three files: *config*, *documents* and *embeddings*.\n",
        "\n",
        "- config - The input configuration passed into the Embeddings object. Serialized with [Python's pickle format](https://docs.python.org/3/library/pickle.html).\n",
        "\n",
        "- documents - [SQLite](https://www.sqlite.org/index.html) database. Stores the input text content and associated data.\n",
        "\n",
        "- embeddings - The embeddings index file. This is an [Approximate Nearest Neighbor (ANN)](https://en.wikipedia.org/wiki/Nearest_neighbor_search#Approximate_nearest_neighbor) index with either [Faiss](https://github.com/facebookresearch/faiss) (default), [Hnswlib](https://github.com/nmslib/hnswlib) or [Annoy](https://github.com/spotify/annoy), depending on the settings."
      ],
      "metadata": {
        "id": "oH4Yd9BOlo5u"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Config\n",
        "\n",
        "Given that the configuration file is serialized with Python pickle, it can be loaded in Python."
      ],
      "metadata": {
        "id": "xO3CokBlzCfc"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "import json\n",
        "import pickle\n",
        "\n",
        "with open(\"index/config\", \"rb\") as config:\n",
        "  print(json.dumps(pickle.load(config), sort_keys=True, indent=2))"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "aNQSCiXHzOTj",
        "outputId": "00b5ebdf-961b-45ac-d90c-e6b824c11979"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "{\n",
            "  \"backend\": \"faiss\",\n",
            "  \"build\": {\n",
            "    \"create\": \"2022-03-02T15:18:41Z\",\n",
            "    \"python\": \"3.7.12\",\n",
            "    \"settings\": {\n",
            "      \"components\": \"IDMap,Flat\"\n",
            "    },\n",
            "    \"system\": \"Linux (x86_64)\",\n",
            "    \"txtai\": \"4.3.0\"\n",
            "  },\n",
            "  \"content\": \"sqlite\",\n",
            "  \"dimensions\": 768,\n",
            "  \"objects\": true,\n",
            "  \"offset\": 6,\n",
            "  \"path\": \"sentence-transformers/nli-mpnet-base-v2\",\n",
            "  \"update\": \"2022-03-02T15:18:41Z\"\n",
            "}\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "Notice how this is the same output as `embeddings.info()`."
      ],
      "metadata": {
        "id": "_LJvaPzFzqId"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Documents\n",
        "\n",
        "The documents file is a SQLite database with three tables, documents, objects and sections. Let's take a look inside."
      ],
      "metadata": {
        "id": "i5_m92oSz3eK"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "import pandas as pd\n",
        "import sqlite3\n",
        "\n",
        "from IPython.display import display, Markdown\n",
        "\n",
        "# Print details of a txtai SQLite document database\n",
        "def showdb(path):\n",
        "  db = sqlite3.connect(path)\n",
        "\n",
        "  display(Markdown(\"## Tables\"))\n",
        "  df = pd.read_sql_query(\"select name FROM sqlite_master where type='table'\", db)\n",
        "  display(df.style.hide_index())\n",
        "\n",
        "  for table in df[\"name\"]:\n",
        "    display(Markdown(f\"## {table}\"))\n",
        "    df = pd.read_sql_query(f\"select * from {table}\", db)\n",
        "\n",
        "    # Truncate large binary objects\n",
        "    if \"object\" in df:\n",
        "      df[\"object\"] = df[\"object\"].str.slice(0, 25)\n",
        "\n",
        "    display(df.style.hide_index())\n",
        "\n",
        "showdb(\"index/documents\")"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 619
        },
        "id": "32TmOeRZ0Lec",
        "outputId": "895b569c-3509-4f38-c4eb-36340d718d15"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "display_data",
          "data": {
            "text/markdown": "## Tables",
            "text/plain": [
              "<IPython.core.display.Markdown object>"
            ]
          },
          "metadata": {}
        },
        {
          "output_type": "display_data",
          "data": {
            "text/html": [
              "<style type=\"text/css\">\n",
              "</style>\n",
              "<table id=\"T_77cb7_\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr>\n",
              "      <th class=\"col_heading level0 col0\" >name</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <td id=\"T_77cb7_row0_col0\" class=\"data row0 col0\" >documents</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <td id=\"T_77cb7_row1_col0\" class=\"data row1 col0\" >objects</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <td id=\"T_77cb7_row2_col0\" class=\"data row2 col0\" >sections</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n"
            ],
            "text/plain": [
              "<pandas.io.formats.style.Styler at 0x7f686163de90>"
            ]
          },
          "metadata": {}
        },
        {
          "output_type": "display_data",
          "data": {
            "text/markdown": "## documents",
            "text/plain": [
              "<IPython.core.display.Markdown object>"
            ]
          },
          "metadata": {}
        },
        {
          "output_type": "display_data",
          "data": {
            "text/html": [
              "<style type=\"text/css\">\n",
              "</style>\n",
              "<table id=\"T_71e4b_\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr>\n",
              "      <th class=\"col_heading level0 col0\" >id</th>\n",
              "      <th class=\"col_heading level0 col1\" >data</th>\n",
              "      <th class=\"col_heading level0 col2\" >tags</th>\n",
              "      <th class=\"col_heading level0 col3\" >entry</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "  </tbody>\n",
              "</table>\n"
            ],
            "text/plain": [
              "<pandas.io.formats.style.Styler at 0x7f686163e850>"
            ]
          },
          "metadata": {}
        },
        {
          "output_type": "display_data",
          "data": {
            "text/markdown": "## objects",
            "text/plain": [
              "<IPython.core.display.Markdown object>"
            ]
          },
          "metadata": {}
        },
        {
          "output_type": "display_data",
          "data": {
            "text/html": [
              "<style type=\"text/css\">\n",
              "</style>\n",
              "<table id=\"T_826d2_\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr>\n",
              "      <th class=\"col_heading level0 col0\" >id</th>\n",
              "      <th class=\"col_heading level0 col1\" >object</th>\n",
              "      <th class=\"col_heading level0 col2\" >tags</th>\n",
              "      <th class=\"col_heading level0 col3\" >entry</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "  </tbody>\n",
              "</table>\n"
            ],
            "text/plain": [
              "<pandas.io.formats.style.Styler at 0x7f686163e850>"
            ]
          },
          "metadata": {}
        },
        {
          "output_type": "display_data",
          "data": {
            "text/markdown": "## sections",
            "text/plain": [
              "<IPython.core.display.Markdown object>"
            ]
          },
          "metadata": {}
        },
        {
          "output_type": "display_data",
          "data": {
            "text/html": [
              "<style type=\"text/css\">\n",
              "</style>\n",
              "<table id=\"T_ca47c_\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr>\n",
              "      <th class=\"col_heading level0 col0\" >indexid</th>\n",
              "      <th class=\"col_heading level0 col1\" >id</th>\n",
              "      <th class=\"col_heading level0 col2\" >text</th>\n",
              "      <th class=\"col_heading level0 col3\" >tags</th>\n",
              "      <th class=\"col_heading level0 col4\" >entry</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <td id=\"T_ca47c_row0_col0\" class=\"data row0 col0\" >0</td>\n",
              "      <td id=\"T_ca47c_row0_col1\" class=\"data row0 col1\" >0</td>\n",
              "      <td id=\"T_ca47c_row0_col2\" class=\"data row0 col2\" >US tops 5 million confirmed virus cases</td>\n",
              "      <td id=\"T_ca47c_row0_col3\" class=\"data row0 col3\" >None</td>\n",
              "      <td id=\"T_ca47c_row0_col4\" class=\"data row0 col4\" >2022-03-02 15:18:40.591760</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <td id=\"T_ca47c_row1_col0\" class=\"data row1 col0\" >1</td>\n",
              "      <td id=\"T_ca47c_row1_col1\" class=\"data row1 col1\" >1</td>\n",
              "      <td id=\"T_ca47c_row1_col2\" class=\"data row1 col2\" >Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg</td>\n",
              "      <td id=\"T_ca47c_row1_col3\" class=\"data row1 col3\" >None</td>\n",
              "      <td id=\"T_ca47c_row1_col4\" class=\"data row1 col4\" >2022-03-02 15:18:40.591760</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <td id=\"T_ca47c_row2_col0\" class=\"data row2 col0\" >2</td>\n",
              "      <td id=\"T_ca47c_row2_col1\" class=\"data row2 col1\" >2</td>\n",
              "      <td id=\"T_ca47c_row2_col2\" class=\"data row2 col2\" >Beijing mobilises invasion craft along coast as Taiwan tensions escalate</td>\n",
              "      <td id=\"T_ca47c_row2_col3\" class=\"data row2 col3\" >None</td>\n",
              "      <td id=\"T_ca47c_row2_col4\" class=\"data row2 col4\" >2022-03-02 15:18:40.591760</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <td id=\"T_ca47c_row3_col0\" class=\"data row3 col0\" >3</td>\n",
              "      <td id=\"T_ca47c_row3_col1\" class=\"data row3 col1\" >3</td>\n",
              "      <td id=\"T_ca47c_row3_col2\" class=\"data row3 col2\" >The National Park Service warns against sacrificing slower friends in a bear attack</td>\n",
              "      <td id=\"T_ca47c_row3_col3\" class=\"data row3 col3\" >None</td>\n",
              "      <td id=\"T_ca47c_row3_col4\" class=\"data row3 col4\" >2022-03-02 15:18:40.591760</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <td id=\"T_ca47c_row4_col0\" class=\"data row4 col0\" >4</td>\n",
              "      <td id=\"T_ca47c_row4_col1\" class=\"data row4 col1\" >4</td>\n",
              "      <td id=\"T_ca47c_row4_col2\" class=\"data row4 col2\" >Maine man wins $1M from $25 lottery ticket</td>\n",
              "      <td id=\"T_ca47c_row4_col3\" class=\"data row4 col3\" >None</td>\n",
              "      <td id=\"T_ca47c_row4_col4\" class=\"data row4 col4\" >2022-03-02 15:18:40.591760</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <td id=\"T_ca47c_row5_col0\" class=\"data row5 col0\" >5</td>\n",
              "      <td id=\"T_ca47c_row5_col1\" class=\"data row5 col1\" >5</td>\n",
              "      <td id=\"T_ca47c_row5_col2\" class=\"data row5 col2\" >Make huge profits without work, earn up to $100,000 a day</td>\n",
              "      <td id=\"T_ca47c_row5_col3\" class=\"data row5 col3\" >None</td>\n",
              "      <td id=\"T_ca47c_row5_col4\" class=\"data row5 col4\" >2022-03-02 15:18:40.591760</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n"
            ],
            "text/plain": [
              "<pandas.io.formats.style.Styler at 0x7f68631d1510>"
            ]
          },
          "metadata": {}
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "`documents` stores additional text fields as JSON, `objects` stores binary content and `sections` stores indexed text. The only table with data as of now is `sections`. `sections` stores the input (id, text, tags) elements along with internal ids and entry dates. \n",
        "\n",
        "We'll come back to `documents` and `objects`."
      ],
      "metadata": {
        "id": "-nmu31TQ4gSv"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Embeddings\n",
        "\n",
        "Embeddings is the ANN index and what is queried when running similarity search. The default setting is to use Faiss. Let's inspect!"
      ],
      "metadata": {
        "id": "v3SsQCCD7lR7"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "import faiss\n",
        "import numpy as np\n",
        "\n",
        "# Query\n",
        "query = \"feel good story\"\n",
        "\n",
        "# Read index\n",
        "index = faiss.read_index(\"index/embeddings\")\n",
        "print(index)\n",
        "print(f\"Total records: {index.ntotal}, dimensions: {index.d}\")\n",
        "print()\n",
        "\n",
        "# Generate query embeddings and run query\n",
        "queries = np.array([embeddings.transform((None, query, None))])\n",
        "scores, ids = index.search(queries, 1)\n",
        "\n",
        "# Lookup query result from original data array\n",
        "result = data[ids[0][0]]\n",
        "\n",
        "# Show results\n",
        "print(\"Query:\", query)\n",
        "print(\"Results:\", result, ids, scores)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "ofIHY-pV7kWH",
        "outputId": "f990cc01-e235-4010-ccfd-fdbb5692cabe"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "<faiss.swigfaiss.IndexIDMap; proxy of <Swig Object of type 'faiss::IndexIDMapTemplate< faiss::Index > *' at 0x7f68631cd750> >\n",
            "Total records: 6, dimensions: 768\n",
            "\n",
            "Query: feel good story\n",
            "Results: Maine man wins $1M from $25 lottery ticket [[4]] [[0.08329004]]\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Index compression\n",
        "\n",
        "txtai normally saves index files to a directory. Indexes can also be compressed. Nothing is different other than the files being in an compressed file format vs a directory."
      ],
      "metadata": {
        "id": "s9aLt2zF2ZW2"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Save index as tar.xz\n",
        "embeddings.save(\"index.tar.xz\")\n",
        "!tar -tvJf index.tar.xz\n",
        "!echo\n",
        "!xz -l index.tar.xz\n",
        "!echo\n",
        "\n",
        "# Reload index\n",
        "embeddings.load(\"index.tar.xz\")\n",
        "\n",
        "# Test search matches\n",
        "embeddings.search(\"feel good story\", 1)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "0oOC8ToG1pyn",
        "outputId": "6fa8a8a7-3831-4307-a818-a4b62f8a81e8"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "drwx------ root/root         0 2022-03-02 15:18 ./\n",
            "-rw-r--r-- root/root       295 2022-03-02 15:18 ./config\n",
            "-rw-r--r-- root/root     28672 2022-03-02 15:18 ./documents\n",
            "-rw-r--r-- root/root     18570 2022-03-02 15:18 ./embeddings\n",
            "\n",
            "Strms  Blocks   Compressed Uncompressed  Ratio  Check   Filename\n",
            "    1       1     18.1 KiB     50.0 KiB  0.361  CRC64   index.tar.xz\n",
            "\n"
          ]
        },
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "[{'id': '4',\n",
              "  'score': 0.08329004049301147,\n",
              "  'text': 'Maine man wins $1M from $25 lottery ticket'}]"
            ]
          },
          "metadata": {},
          "execution_count": 32
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Content storage\n",
        "\n",
        "Let's add additional metadata and binary content to the index and see how that is stored in the SQLite database."
      ],
      "metadata": {
        "id": "lGmiYXyqyjtQ"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "import urllib\n",
        "\n",
        "from IPython.display import Image\n",
        "\n",
        "# Get an image\n",
        "request = urllib.request.urlopen(\"https://raw.githubusercontent.com/neuml/txtai/master/demo.gif\")\n",
        "\n",
        "# Get data\n",
        "data = request.read()\n",
        "\n",
        "# Upsert new record having both text and an object\n",
        "embeddings.upsert([(\"txtai\", {\"text\": \"txtai executes machine-learning workflows to transform data and build AI-powered semantic search applications.\", \"size\": len(data), \"object\": data}, None)])\n",
        "\n",
        "embeddings.save(\"index\")\n",
        "showdb(\"index/documents\")"
      ],
      "metadata": {
        "id": "Ef4-Gd8ZtzUF",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 713
        },
        "outputId": "0f290fdc-2bb7-4022-e4a0-1dc54b080bc5"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "display_data",
          "data": {
            "text/markdown": "## Tables",
            "text/plain": [
              "<IPython.core.display.Markdown object>"
            ]
          },
          "metadata": {}
        },
        {
          "output_type": "display_data",
          "data": {
            "text/html": [
              "<style type=\"text/css\">\n",
              "</style>\n",
              "<table id=\"T_116f6_\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr>\n",
              "      <th class=\"col_heading level0 col0\" >name</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <td id=\"T_116f6_row0_col0\" class=\"data row0 col0\" >documents</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <td id=\"T_116f6_row1_col0\" class=\"data row1 col0\" >objects</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <td id=\"T_116f6_row2_col0\" class=\"data row2 col0\" >sections</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n"
            ],
            "text/plain": [
              "<pandas.io.formats.style.Styler at 0x7f68632cf7d0>"
            ]
          },
          "metadata": {}
        },
        {
          "output_type": "display_data",
          "data": {
            "text/markdown": "## documents",
            "text/plain": [
              "<IPython.core.display.Markdown object>"
            ]
          },
          "metadata": {}
        },
        {
          "output_type": "display_data",
          "data": {
            "text/html": [
              "<style type=\"text/css\">\n",
              "</style>\n",
              "<table id=\"T_c2eee_\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr>\n",
              "      <th class=\"col_heading level0 col0\" >id</th>\n",
              "      <th class=\"col_heading level0 col1\" >data</th>\n",
              "      <th class=\"col_heading level0 col2\" >tags</th>\n",
              "      <th class=\"col_heading level0 col3\" >entry</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <td id=\"T_c2eee_row0_col0\" class=\"data row0 col0\" >txtai</td>\n",
              "      <td id=\"T_c2eee_row0_col1\" class=\"data row0 col1\" >{\"text\": \"txtai executes machine-learning workflows to transform data and build AI-powered semantic search applications.\", \"size\": 47189}</td>\n",
              "      <td id=\"T_c2eee_row0_col2\" class=\"data row0 col2\" >None</td>\n",
              "      <td id=\"T_c2eee_row0_col3\" class=\"data row0 col3\" >2022-03-02 15:19:00.708223</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n"
            ],
            "text/plain": [
              "<pandas.io.formats.style.Styler at 0x7f6861966890>"
            ]
          },
          "metadata": {}
        },
        {
          "output_type": "display_data",
          "data": {
            "text/markdown": "## objects",
            "text/plain": [
              "<IPython.core.display.Markdown object>"
            ]
          },
          "metadata": {}
        },
        {
          "output_type": "display_data",
          "data": {
            "text/html": [
              "<style type=\"text/css\">\n",
              "</style>\n",
              "<table id=\"T_683a5_\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr>\n",
              "      <th class=\"col_heading level0 col0\" >id</th>\n",
              "      <th class=\"col_heading level0 col1\" >object</th>\n",
              "      <th class=\"col_heading level0 col2\" >tags</th>\n",
              "      <th class=\"col_heading level0 col3\" >entry</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <td id=\"T_683a5_row0_col0\" class=\"data row0 col0\" >txtai</td>\n",
              "      <td id=\"T_683a5_row0_col1\" class=\"data row0 col1\" >b'GIF89a\\x9b\\x04\\x18\\x03\\xf5\\x00\\x00\\x12\\x13\\x14\\xcc\\xcc\\xcc\\x13\\x14\\x15\\xbd\\xbd\\xbd'</td>\n",
              "      <td id=\"T_683a5_row0_col2\" class=\"data row0 col2\" >None</td>\n",
              "      <td id=\"T_683a5_row0_col3\" class=\"data row0 col3\" >2022-03-02 15:19:00.708223</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n"
            ],
            "text/plain": [
              "<pandas.io.formats.style.Styler at 0x7f6861966890>"
            ]
          },
          "metadata": {}
        },
        {
          "output_type": "display_data",
          "data": {
            "text/markdown": "## sections",
            "text/plain": [
              "<IPython.core.display.Markdown object>"
            ]
          },
          "metadata": {}
        },
        {
          "output_type": "display_data",
          "data": {
            "text/html": [
              "<style type=\"text/css\">\n",
              "</style>\n",
              "<table id=\"T_74f8d_\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr>\n",
              "      <th class=\"col_heading level0 col0\" >indexid</th>\n",
              "      <th class=\"col_heading level0 col1\" >id</th>\n",
              "      <th class=\"col_heading level0 col2\" >text</th>\n",
              "      <th class=\"col_heading level0 col3\" >tags</th>\n",
              "      <th class=\"col_heading level0 col4\" >entry</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <td id=\"T_74f8d_row0_col0\" class=\"data row0 col0\" >0</td>\n",
              "      <td id=\"T_74f8d_row0_col1\" class=\"data row0 col1\" >0</td>\n",
              "      <td id=\"T_74f8d_row0_col2\" class=\"data row0 col2\" >US tops 5 million confirmed virus cases</td>\n",
              "      <td id=\"T_74f8d_row0_col3\" class=\"data row0 col3\" >None</td>\n",
              "      <td id=\"T_74f8d_row0_col4\" class=\"data row0 col4\" >2022-03-02 15:18:40.591760</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <td id=\"T_74f8d_row1_col0\" class=\"data row1 col0\" >1</td>\n",
              "      <td id=\"T_74f8d_row1_col1\" class=\"data row1 col1\" >1</td>\n",
              "      <td id=\"T_74f8d_row1_col2\" class=\"data row1 col2\" >Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg</td>\n",
              "      <td id=\"T_74f8d_row1_col3\" class=\"data row1 col3\" >None</td>\n",
              "      <td id=\"T_74f8d_row1_col4\" class=\"data row1 col4\" >2022-03-02 15:18:40.591760</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <td id=\"T_74f8d_row2_col0\" class=\"data row2 col0\" >2</td>\n",
              "      <td id=\"T_74f8d_row2_col1\" class=\"data row2 col1\" >2</td>\n",
              "      <td id=\"T_74f8d_row2_col2\" class=\"data row2 col2\" >Beijing mobilises invasion craft along coast as Taiwan tensions escalate</td>\n",
              "      <td id=\"T_74f8d_row2_col3\" class=\"data row2 col3\" >None</td>\n",
              "      <td id=\"T_74f8d_row2_col4\" class=\"data row2 col4\" >2022-03-02 15:18:40.591760</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <td id=\"T_74f8d_row3_col0\" class=\"data row3 col0\" >3</td>\n",
              "      <td id=\"T_74f8d_row3_col1\" class=\"data row3 col1\" >3</td>\n",
              "      <td id=\"T_74f8d_row3_col2\" class=\"data row3 col2\" >The National Park Service warns against sacrificing slower friends in a bear attack</td>\n",
              "      <td id=\"T_74f8d_row3_col3\" class=\"data row3 col3\" >None</td>\n",
              "      <td id=\"T_74f8d_row3_col4\" class=\"data row3 col4\" >2022-03-02 15:18:40.591760</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <td id=\"T_74f8d_row4_col0\" class=\"data row4 col0\" >4</td>\n",
              "      <td id=\"T_74f8d_row4_col1\" class=\"data row4 col1\" >4</td>\n",
              "      <td id=\"T_74f8d_row4_col2\" class=\"data row4 col2\" >Maine man wins $1M from $25 lottery ticket</td>\n",
              "      <td id=\"T_74f8d_row4_col3\" class=\"data row4 col3\" >None</td>\n",
              "      <td id=\"T_74f8d_row4_col4\" class=\"data row4 col4\" >2022-03-02 15:18:40.591760</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <td id=\"T_74f8d_row5_col0\" class=\"data row5 col0\" >5</td>\n",
              "      <td id=\"T_74f8d_row5_col1\" class=\"data row5 col1\" >5</td>\n",
              "      <td id=\"T_74f8d_row5_col2\" class=\"data row5 col2\" >Make huge profits without work, earn up to $100,000 a day</td>\n",
              "      <td id=\"T_74f8d_row5_col3\" class=\"data row5 col3\" >None</td>\n",
              "      <td id=\"T_74f8d_row5_col4\" class=\"data row5 col4\" >2022-03-02 15:18:40.591760</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <td id=\"T_74f8d_row6_col0\" class=\"data row6 col0\" >6</td>\n",
              "      <td id=\"T_74f8d_row6_col1\" class=\"data row6 col1\" >txtai</td>\n",
              "      <td id=\"T_74f8d_row6_col2\" class=\"data row6 col2\" >txtai executes machine-learning workflows to transform data and build AI-powered semantic search applications.</td>\n",
              "      <td id=\"T_74f8d_row6_col3\" class=\"data row6 col3\" >None</td>\n",
              "      <td id=\"T_74f8d_row6_col4\" class=\"data row6 col4\" >2022-03-02 15:19:00.708223</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n"
            ],
            "text/plain": [
              "<pandas.io.formats.style.Styler at 0x7f686319be50>"
            ]
          },
          "metadata": {}
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "This section added a new record with metadata and binary content (truncated when printed here). The `documents` table enables additional fielded search with SQL. "
      ],
      "metadata": {
        "id": "gcgtUQnACf5c"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "embeddings.search(\"select * from txtai where size > 0\")"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "lz7xwroECzx2",
        "outputId": "3740cb3b-5904-453e-af93-5ee98c14652d"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "[{'data': '{\"text\": \"txtai executes machine-learning workflows to transform data and build AI-powered semantic search applications.\", \"size\": 47189}',\n",
              "  'entry': '2022-03-02 15:19:00.708223',\n",
              "  'id': 'txtai',\n",
              "  'indexid': 6,\n",
              "  'object': <_io.BytesIO at 0x7f6861408a70>,\n",
              "  'score': None,\n",
              "  'tags': None,\n",
              "  'text': 'txtai executes machine-learning workflows to transform data and build AI-powered semantic search applications.'}]"
            ]
          },
          "metadata": {},
          "execution_count": 34
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "Metadata fields can also be selected and combined with similarity queries."
      ],
      "metadata": {
        "id": "9fOzYXY6DJFj"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "embeddings.search(\"select text, size, score from txtai where similar('machine learning') and score > 0.25 and size > 0\")"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "DXWf90-UDM0H",
        "outputId": "7c31c4ea-5e2d-4873-d9cf-d9b7e6196754"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "[{'score': 0.5479326844215393,\n",
              "  'size': 47189,\n",
              "  'text': 'txtai executes machine-learning workflows to transform data and build AI-powered semantic search applications.'}]"
            ]
          },
          "metadata": {},
          "execution_count": 35
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "The `objects` table enables additional binary content to be stored alongside an embeddings index. In some cases (image search), the object content is used to build embeddings.\n",
        "\n",
        "Otherwise, it's the text field from sections. In both cases, associated binary objects are available at search time. "
      ],
      "metadata": {
        "id": "XvBaEBCDIUN6"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "embeddings.search(\"select object from txtai where object is not null\")"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "RaJPqDV3I3sm",
        "outputId": "3c416f6f-2ca6-481b-dc53-193e89f7da3e"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "[{'object': <_io.BytesIO at 0x7f6863246470>}]"
            ]
          },
          "metadata": {},
          "execution_count": 36
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "aDIF3tYt6X0O"
      },
      "source": [
        "# Wrapping up\n",
        "\n",
        "This notebook gave an overview of the txtai embeddings index file format. This hopefully gives a basic understanding of the architecture and/or helps with debugging when running into issues. \n",
        "\n",
        "See the following links for more information.\n",
        "\n",
        "- [GitHub](https://github.com/neuml/txtai)\n",
        "- [Embeddings documentation](https://neuml.github.io/txtai/embeddings)"
      ]
    }
  ]
}